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ABSTRACT 

We present a new approach, kernel regression, to determine photometric redshifts for 399,929 
galaxies in the Fifth Data Release of the Sloan Digital Sky Survey (SDSS). In our case, kernel 
regression is a weighted average of spectral redshifts of the neighbors for a query point, where 
higher weights are associated with points that are closer to the query point. One important de- 
sign decision when using kernel regression is the choice of the bandwidth. We apply 10-fold 
cross-validation to choose the optimal bandwidth, which is obtained as the cross-validation 
error approaches the minimum. The experiments show that the optimal bandwidth is different 
for diverse input patterns, the least rms error of photometric redshift estimation arrives at 0.019 
using colorH-eClass as the inputs, the less rms error amounts to 0.020 using ugriz+eClass as 
the inputs. Here eClass is a galaxy spectra type. Then the little rms scatter is 0.021 with 
color+r as the inputs. As a result, except the parameters (e.g. magnitudes and colors), eClass 
is a valid parameter to predict photometric redshifts. Moreover the results also suggest that 
the accuracy of estimating photometric redshifts is improved when the sample is divided into 
early-type galaxies and late-type ones, especially for early-type ones, the rms scatter amounts 
to 0.016 with colorH-eClass as the inputs. In addition, kernel regression achieves high accu- 
racy to predict the photometric eClass ((Tims = 0.034) using color+r as the input pattern. For 
kernel regression, the more parameters considered, the accuracy of photometric redshifts is 
not always higher, but satisfactory only when appropriate parameters are chosen. Kernel re- 
gression is comprehensible and accurate regression models of the data. Experiments reveal 
the superiority of kernel regression when compared to other empirical training approaches. 

Key words: galaxies: distances and redshifts-Methods: statistical 



1 INTRODUCTION 

In general, the redshifts of galaxy are measured spectroscopically. 
In order to achieve high signal-to-noise spectra, long integration 
time is required. For those large and faint sets of galaxies, however, 
spectra of galaxies are not easy or impractical to obtain. In the ab- 
sence of spectroscopic data, redshifts of galaxies may be estimated 
using medium- or broadband photometry, which may be thought of 
as very low-resolution spectroscopy. Though such photometric red- 
shifts are necessarily less accurate than true spectroscopic redshifts, 
they nonetheless are sufficient to determine the formation and evo- 
lution properties of large number of galaxies rather than to study ac- 
curate redshift of individual galaxy (Gwyn 1990). Photometric red- 
shifts may be obtained less expensively and for much larger sam- 
ples than is possible with spectroscopy. In the nineties, photometric 
redshifts is rapidly becoming a crucial tool in mainstream observa- 
tional cosmology. To date, some photometric redshift catalogs have 
been used to deal with several scientific issues, e.g. the evolution of 
the luminosity density and the number of massive galaxies already 
assembled at early epochs (Fontana et al. 2000), the evolution of 



galaxy size (Poli et al. 1999; Giallongo et al. 2000), the determi- 
nation of cosmological baryonic and matter densities (Blake et al. 
2007), and the clustering of luminous red galaxies in SDSS imag- 
ing data (Padmanabhan et al. 2007) . 

Techniques for deriving photometric redshifts were pioneered 
by Baum (1962). Subsequent implementations of these basic tech- 
niques have been made by Couch et al. (1983) and Koo (1985). 
Photometric redshift techniques have been divided into two broad 
categories: template matching method and empirical training-set 
method. There are advantages and disadvantages to each approach. 
The former approach relies on fitting model galaxy spectral en- 
ergy distributions (SEDs) to the photometric data, where the mod- 
els span a range of expected galaxy redshifts and spectral types 
(e.g., Sawicki, Lin & Yee 1997). A library of template spectra (e.g. 
Bruzual & Chariot 1993; Coleman, Wu & Weedman 1980) are em- 
ployed. A fit is used to obtain the optimal template pairs for each 
galaxy. The various techniques in this kind is different from their 
choice of template SED's and in the procedure for fitting. Template 
SED's may come from population synthesis models (eg. Bruzual & 
Chariot 1993) or from spectra of real objects (eg. Coleman, Wu & 
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Weedman 1980). Both kinds of templates have their weaknesses - 
template SED's from population synthesis models may include un- 
realistic combinations of parameters or exclude known cases. The 
real galaxy templates are almost always derived from data on bright 
low redshift galaxies, and may be poor representations of the high 
redshift galaxy population (Wadadekar 2005). 

The latter approach depends on using an existing spectro- 
scopic redshift sample as a training set to derive photometric red- 
shifts as the function of photometric data. Some typical training-set 
methods employed include: artificial neural networks (ANNs, Col- 
ister & Lahav 2004; Firth, Lahav & Somerville 2003; Vanzella et al. 
2004; Li et al. 2006), support vector machines (SVMs, Wadadekar 
2005), ensemble learning and Gaussian process regression (Way 
& Srivastava 2006) and linear and non-linear polynomial fitting 
(Brunner et al. 1997; Wang, Bahcall, & Turner 1998; Budavdri et 
al. 2005; Hsieh et al. 2005; Connolly et al. 1995). Such techniques 
have strengthes that they are automatically constructed by the prop- 
erties of galaxies in the real universe and require no additional as- 
sumptions about their formation and evolution. However for the 
empirical best fit method, such as linear and non-linear polynomial 
fitting, it is difficult to extrapolate to objects fainter than the spec- 
troscopic limit. For the ANN approach, its optimal architecture is 
not easy to obtain, moreover and it is easy to get stuck in local min- 
ima during training stage. Unlike ANNs, SVMs do not need choice 
of architecture before training, but the optimal parameters in then- 
models are obtained with much effort. 

Another interpolative training- set methods are instance-based 
learning techniques, applied to predict photometric redshifts (eg. 
Csabai et al. 2003; Ball et al. 2007). Instance-based learning meth- 
ods base their predictions directly on (training) data that has been 
stored in the memory. Usually they store all the training data in the 
memory during the learning phase, and defer all the essential com- 
putation until the prediction phase. Examples of such techniques 
are fe-nearest neighbor, kernel regression and locally weighted re- 
gression. If setting A; to n (the nvunber of data points) and optimiz- 
ing weights by gradient descent, fc-nearest neighbor turns into ker- 
nel regression, while locally weighted regression generalized ker- 
nel regression, not just obtains local average values. In general, ir- 
relevant features are often killers for instance-based approaches. 
But ANNs can be trained directly on problems with hundreds or 
thousands of inputs. Instance-based learning methods can fit low 
dimensional, very complex functions very accurately while ANNs 
require considerable tweaking to do this. When adding new data, 
training is almost free for instance-based learning methods, but 
ANNs and SVMs need retraining the data. 

We put forward a kernel regression method to estimate pho- 
tometric redshifts. This paper is organized as follows. In Section 2 
we describe the data we use. A brief overview of kernel regression 
is addressed in Section 3. Section 4 illustrates the results and dis- 
cussion, and the conclusion is presented in Section 5. 



2 DATA 

The Sloan Digital Sky Survey (SDSS, York et al. 2000) is the most 
ambitious astronomical survey ever undertaken. When completed, 
it will provide detailed optical images covering more than a quarter 
of the sky, and a 3-dimensional map of about a million galaxies and 
quasars, with a dedicated 2.5-meter telescope located on Apacho 
Point, New Mexico. The first stage of SDSS is already complete 
(with DR5). It has imaged 8,000 square degrees in five bandpasses 
(w, g, r, i, z) and measured spectra of more than 675,000 galaxies. 



90,000 quasars and 185,000 stars. In its second stage, SDSS will 
carry out three new surveys in different research areas, such as the 
nature of the universe, the origin of galaxies and quasars and the 
formation and evolution of the Milky Way. In order to construct 
a representative sample set, we collected all objects satisfying the 
follow criteria from SDSS Data Release 5 (Adelman-McCarthy et 
al. 2007). All following mentioned magnitudes are magnitudes cor- 
rected by Galaxy extinction using the dust maps of Schlegel et al. 
1998. After these restrictions that the spectroscopic redshift confi- 
dence must be greater than or equal to 0.95, and the redshift flags 
should be zero, we obtained a sample containing 399,929 galaxies. 

The photometry properties discussed below are available in 
all five SDSS bandpasses (ugriz), however the r-bandpass values 
for these quantities are usually applied for the r-band result gen- 
erally has the lowest error and gives more consistent results (Way 
& Srivastava 2006). The Petrosian 50% (90%) radius is the radius 
where 50% (90%) of the flux of the object contributes. r50 is Pet- 
rosian 50% radius in r band, r90 is Petrosian 90% radius in r band. 
The ratio of these quantities is called Petrosian concentration index 
c=r90/r50, which is an indicator of the galaxy type: early-type 
galaxy with c > 2.5 and late-type galaxy with c < 2.5 (Strateva 
et al. 2001). The Petrosian Radii are also utilized together with a 
measure of the profile type from the SDSS photometric pipeline 
reduction named fracDeV. fracDeV results from a linear combina- 
tion of the best exponential and de Vaucouleus profiles that are fit 
to the image in each band. fracDeV is a floating point number be- 
tween zero and 1. fracDeV is closely related to galaxy type while it 
is 1 for a pure de Vaucouleurs profile typical of early-type galaxies 
and zero for a pure exponential profile typical of late-type galaxies. 
eClass is a spectroscopic parameter giving the spectral type from a 
principal component analysis, which is a continuous value ranging 
from about -0.5 (early-type galaxies) to 1 (late-type galaxies). 



3 KERNEL REGRESSION 
3.1 Overview of the algorithm 

Kernel regression (Watson, 1964; Nadaraya, 1964) belongs to the 
family of instance-based learning algorithms, which simply store 
some or all of the training examples and "delay learning" till pre- 
diction time. Given a query point Xq, a prediction is obtained using 
the training samples that are "most similar" to Xq. Similarity is 
measured by means of a distance metric defined in the hyper-space 
of V predictor variables. Kernel regressors obtain the prediction for 
a query point Xq, by a weighted average of the y values of its neigh- 
bors. The weight of each neighbor is calculated by a fimction of its 
distance to Xq (called the kernel function). These kernel functions 
give more weight to neighbors that are nearer to Xq. The notion of 
neighborhood (or bandwidth) is defined in terms of distance from 
Xq. The prediction for query point Xq is obtained by 

N 

2/. = ^^^1^ (1) 

i=l 

where D{.) is the distance function between two instances; K{.) 
is a kernel function; /i is a bandwidth value; (xi, yi) are training 
samples; Xi and Xq are vectors; N is the number of datapoints used 
in the model. In this paper, we use Euclidian distance and Gaussian 
kernel function. Xi is the feature for each training sample, yi is the 
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spectroscopic redshift for each training set sample, j/q is tlie redshift 
of each query sample. 



3.2 Bandwidth determination 

One important design decision when using kernel regression is the 
choice of the bandwidth h. The larger h results in the flatter weight 
function curve, which indicates that many points of training set 
contribute quite evenly to the regression. As the h tends to infin- 
ity the predictions approach the global average of all points in the 
database. If the h is very small, only closely neighboring datapoints 
make a significant contribution. If the data is relatively noisy, we 
expect to obtain smaller prediction errors with a relatively larger 
/),. If the data is noise free, then a small h will avoid smearing 
away fine details in the function. There exists mature algorithms 
for choosing the bandwidth for kernel regression that minimize a 
statistical measure of the difference between the true underlying 
distribution and the estimated distribution. Usually bandwidth se- 
lection in regression is done by cross-validation (CV) or the penal- 
ized residual sum of squares. 

Cross-validation is the statistical method of dividing a sample 
of data into subsets such that the analysis is initially performed on a 
single subset, while the other subset(s) are retained for subsequent 
use in confirming and validating the initial analysis. Af-fold cross- 
validation is one important cross-validation method. The data is di- 
vided into M subsets of (approximately) equal size. Each time, one 
of the M subsets is used as the test set and the other M— 1 subsets 
are put together to form a training set. Cross-validation is designed 
to choose the bandwidth by minimizing the cross-validation score 
CV(ft) defined by 

i=0 i=Q 

+ - + T— y](2/Mi - VMif] (2) 

where yji is the spectroscopic redshift for each test set sample, yji 
is the predicted photometric redshift of each test sample, kj is the 
number of objects in each subset (j = 1, 2, M), M is the num- 
ber of subsets for cross-validation. In general, the kj values are 
identical. Here we adopt 10-fold cross-validation for the bandwidth 
choice, i.e. A/=10, firstly divide the sample of 399,929 galaxies 
into 10 subsets, then 9 subsets of 10 subsets are taken as training 
set and the rest subset as testing set for ten times. 

We adopt the sample described in Section 2, applying four 
color indexes (u — g, g — r, r — i and i — z) and spectroscopic 
redshifts as input parameters. Then we implement kernel regression 
on this sample and compute the 10-fold cross- vaUdated score for 
different bandwidths in Table 1. As shown by Table 1, the cross- 
validated score CV(/i) reaches the minimum 5.559 X IQ-* when 
h is equal to 0.02. Therefore, 0.02 has been assigned to the optimal 
fixed bandwidth for the sample in this case. 



3.3 Input pattern selection 

In this work, we choose the input parameters using the Akaike In- 
formation Criterion (AIC). AIC (Akaike 1974) is a measure of the 
goodness of fit of an estimated statistical model. The AIC method- 
ology attempts to find the model that best explains the data with a 



Table 1. Bandwidth determination using tiie cross-validated (CV) method 



h 


CV(/i)(xlO-*) 


0.010 


5.668 




J.J 


0.020 


5.559 




5 620 


0.030 


5.725 


0.035 


5.831 


0.040 


5.973 


0.045 


6.112 


0.050 


6.264 


0.055 


6.426 


0.060 


6.601 


0.065 


6.794 


0.070 


6.990 


0.075 


7.195 


0.080 


7.410 


0.085 


7.638 


0.090 


7.877 



minimum of free parameters. In the general case, AIC is 

AIC = -2\nL^^ + 2k (3) 

where Lmax is the maximized likelihood function, and k is the mmi- 
ber of free parameters in the model. 

The purpose of model selection is to identify a model that best 
fits the available data set. A model is better than another model 
if it has a smaller AIC value. When a model approach the lowest 
values of AIC, the model is regarded as the best model. Several 
recent works in astrophysics have used AIC for model selection 
(e.g. Liddle 2004, 2007). In Section 4.1, AIC will be used to select 
the optimal input pattern. 

4 RESULTS AND DISCUSSION 
4.1 RESULTS 

One advantage of the empirical training set approach to photomet- 
ric redshift estimation is that additional parameters can be easily 
incorporated. More parameters (e.g. r50, r90, fracDeV etc.) may 
be taken as inputs. In order to study which parameters influence 
the accuracy of predicting photometric redshifts, we probe different 
input patterns to estimate photometric redshifts. According to the 
bandwidth choice criterion described in Section 3.2, we compute 
the 10-fold cross-validation scores and get the optimal bandwidth 
values corresponding to different situations, as shown in Table 2. 
In order to determine which input pattern is best, we use the AIC 
criterion to investigate this problem. 

When implementing kernel regression to predict photometric 
redshifts, 260,000 galaxies are randomly regarded as training set 
and the rest are as test set. The rms deviations, optimal bandwidth 
and AIC for different input patterns are listed in Table 2. Table 2 
shows that rms error is different for each input pattern while the 
corresponding optimal bandwidth and AIC are different, too. Nev- 
ertheless AIC has the same trend as rms error, i.e. AIC increases 
with the increase of rms error and decreases with the decline of rms 
error. When AIC approaches minimum, the input pattern is consid- 
ered as the best input pattern, vice versa. As a result, the best input 
pattern is four colors (u — g, g — r, r — i, i — z) and eClass when 
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Table 2. rms errors, optimal bandwidths and AIC for different input param- 
eters 



Input Parameters* 




h 


AIC 


ugriz 


0.0215 


0.025 


64.259 


ugriz + r50 + r90 


0.0247 


0.070 


84.282 


ugriz+fiacDsV _r 


0.0223 


0.035 


69.242 


Mgriz+eclass 


0.0198 


0.025 


54.548 


color 


0.0220 


0.020 


67.558 


color+r 


0.0206 


0.030 


58.933 


color+r + c 


0.0206 


0.035 


58.656 


color+r + r50 + r90 


0.0226 


0.050 


70.206 


color+fracDeV.r 


0.0220 


0.025 


67.149 


colov+ugriz 


0.0210 


0.040 


60.961 


color+eclass 


0.0189 


0.025 


49.503 



NOTE. — r50 is Petrosian 50% radius in r band, r90 is Petrosian 90% 
radius in r band, fracDeV.r is fracDeV in r band, color is the color indexes, 
i.e. u — g, g — r, r — i, i — z, and c = r90/r'50. 

rms error amounts to 0.0189. The next better input pattern is five 
magnitudes and eClass when rms error is 0.0198. Then the good 
input pattern is four colors and r magnitude when the rms scatter is 
0.0206. The result with only five magnitudes is better than that with 
only four colors but worse than that with four colors and r magni- 
tude. For five magnitudes as inputs, the performance of kernel re- 
gression decreases when adding r50 and r90 or fracDeV_r except 
eClass. Similarly, for four colors or four colors and r magnitude 
as inputs, the performance becomes worse when also considering 
r50 and r90 or fracDeV.r. The performance adding the Petrosian 
concentration index c hasn't improved compared with only four 
colors and r magnitude as inputs. The result with four colors and 
five magnitudes is superior to that only with colors or only with 
magnitudes, however it is worse than that with four color and r 
magnitude. Therefore when applying kernel regression to predict 
photometric redshifts, we find the parameters except magnitudes 
and color indexes, such as r50, r90, fracDeV_r and c, contribute 
little information, however eClass is important and effective. 

Figure 1 shows the comparison of the known spectroscopic 
redshift with the calculated photometric redshift from the test data 
using kernel regression with the input pattern of color-l-eClass. Con- 
sidering color-l-r as the inputs, the fractions of predicted photomet- 
ric redshifts exceeding ±3(7 and ±4a error bar with the loss of 
estimation are 2.10% and 1.03%, respectively. With color-l-eClass 
as the inputs, the fractions including the loss occupy 2.11% and 
1.28%, separately. The loss of estimation refers to the points whose 
photometric redshifts can not be measured due to their distance to 
neighbors beyond the optimal window width of kernel regression. 

Although eClass is not strictly photometric, it is applicable to 
use this parameter to estimate photometric redshifts when galax- 
ies have low S/N spectra, or they have weak absorbtion or emis- 
sion lines. Moreover it is helpful for the statistical study of a large 
galaxy sample without detailed spectra information. In addition, 
eClass may be estimated with color indexes or magnitudes, just like 
following. The parameter eClass is a continuous parameter rang- 
ing from approximately -0.5 (early type galaxies) to 1 (late type 
galaxies), indicating spectral type in the SDSS spectroscopic cat- 
alog. We use the same sample to estimate eClass rather than red- 
shifts with kernel regression. Based on the result as listed in Ta- 
ble 2, we choose the best input pattern of color-l-r except the pat- 
terns with eClass. The rms scatter is arms = 0.0337, as shown in 




Spectroscopic redshifts 

Figure 1. Comparison between spectroscopic and photometric redshifts. 
260,000 galaxies are regarded as training set. 139,929 galaxies are as test 
set (plotted). The input parameters are u — g,g~r,r — — z and eClass. 




Spectroscopic eClass 

Figure 2. Spectroscopic eClass vs. calculated photometric eClass for 
139,929 galaxies from the SDSS DR5 with kernel regression. The input 
parameters are u — g,g — r,r — — z and r. 



Figure 3. Other researchers have done similar works, for example, 
Wadadekar (2005) utilized support vector machines (S VMs) to pre- 
dict the photometric eClass using 10,000 objects from SDSS Data 
Release 2 and the rms scatter of eClass estimation arms = 0.057; 
Collister & Lahav (2004) obtained arms = 0.052 by artificial neu- 
ral networks (ANNs) for the eClass estimation with 64,175 objects 
from SDSS Data Release 1. 

From Table 2, we can draw a conclusion that spectral type is 
an important parameter for determining photometric redshifts. In 
order to further study how the spectral type influences the accuracy 
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Table 3. Comparison of the accuracy for the separated sample with that for 
the original sample 



Table 4. Various photometric redshift approaches and accuracies 



Input Parameters 




AIC^ 










color 

color+r 

color+eClass 


0.0197 
0.0186 
0.0164 


38.52 
36.96 
30.33 


0.0247 
0.0230 
0.0222 


35.79 
33.86 
31.94 


0.0215 
0.0204 
0.0187 


0.0220 
0.0206 
0.0189 



NOTE. — cr^s is CTrms for early-type galaxies; crj^^^^ is (Trms for late-type 
galaxies; crj^g for the whole sample; AIC^ and AIC^ are AIC values for 
early-type and late-type galaxies, respectively, crrms is taken from Table 2. 



of measuring photometric redshifts, the sample is divided into two 
parts according to the criterion that early-type galaxy is c > 2.5 
and late-type galaxy is c < 2.5 (Strateva et al. 2001). Thus 251,794 
early-type galaxies and 148,135 late-type galaxies are obtained in 
our sample. Then we implement kernel regression on the two sets 
separately. When taking u — g, g — r, r — i, i — z as inputs and 
/i=0.02, the rms dispersion of photometric redshifts is arms=0.0197 
for early-type galaxies and arms=0.0247 for late-type galaxies, the 
rms scatter (crj!^^) for the mixed sample adds up to 0.0215. The 
computation of cr^^ refers to Equation (4). 



^ iVi 



(4) 



where yf and are the spectroscopic redshift for early-type and 
late-type galaxies, respectively; yf and are the predicted photo- 
metric redshift of early-type and late-type galaxies, separately. A^i 
is the number of early-type galaxies; N2 is the number of late-type 
galaxies. 

When taking u—g, g—r, r—i, i — z and r as inputs and ft=0.03, 
the rms error of photometric redshifts is arms =0.0 186 for early-type 
galaxies and o'rms=0.0230 for late-type galaxies, the mixed rms er- 
ror is 0.0204. Considering four color indexes and eClass as inputs 
and /i=0.025, the rms scatter is (Trms=0.0164 for early-type galax- 
ies and o-rms=0.0222 for late-type galaxies, the mixed rms error 
amounts to 0.0187. The rms scatter with two parts of sample out- 
performs that without separating the sample, as shown in Table 3. 
For early-type galaxies, the rms deviation of photometric redshift 
measurement is very satisfactory. Table 3 further indicates that the 
parameter of eClass related to spectral type is robust and significant 
to determine the photometric redshifts and it is also helpful to im- 
prove the accuracy of photometric redshifts with the separation of 
galaxies into early-type ones and late-type ones. In addition, AIC 
values approach minimum simultaneously with color-i-eClass as the 
inputs for early-type and latc-typc galaxies. Therefore, in our case, 
color-i-eClass is the best input pattern to determine photometric red- 
shifts while color-i-r is the next better one. 



4.2 DISCUSSION 

At present there have been many works on the algorithms to deter- 
mining photometric redshifts. Each method has its pros and cons. 
For ANNs, we need to make a decision about the optimal network 
architecture. More complex network architectures we have more 
accurate result. ANNs allow a closer fit to the data, but are sub- 
ject to the danger of overfitting. In addition, adding layers or nodes 
to the network, training time will increase remarkably (Wadadekar 
2005). Comparing to ANNs, SVMs simplifies the training process, 
only need to choose the kernel function rather than the architecture. 
Even simple Gaussian function can give a good performance. How- 
ever, the adjustments of lots of parameters require prior knowledge. 



Method Name 




Data set 


Input parameters 


CWW^ 


0.0666 


SDSS-EDR 


ugriz 


Bruzual-Charlot^ 


0.0552 


SDSS-EDR 


ugriz 


Interpolated^ 


0.0451 


SDSS-EDR 


ugriz 


Polynomial^ 


0.0318 


SDSS-EDR 


ugriz 


Kd-tree^ 


0.0254 


SDSS-EDR 


ugriz 


ClassX^ 


0.0340 


SDSS-DR2 


ugriz 


SVMs3 


0.027 


SDSS-DR2 


ugriz 




0.0230 


SDSS-DR2 


ugriz + rbO + r90 


ANNs* 


0.0229 


SDSS-DRl 


ugriz 


Polynomial^ 


0.025 


SDSS-DR1,GALEX 


ugriz + nuv 


Kernel Regression 


0.0215 


SDSS-DR5 


ugriz 




0.0206 


SDSS-DR5 


color-Hr 




0.0189 


SDSS-DR5 


color-Heclass 



NOTE. — SDSS-EDR = Early Data Release (Stoughton et al. 2002), 

SDSS-DRl = Data Release 1 (Abazajian et al. 2003), SDSS-DR2 = Data 

Release 2 (Abazajian et al. 2004), SDSS-DR5 = Data Release 5 (Adelman- 

McCarthy et al. 2007). r50 is Petrosian 50% radius in r band, r90 is Pet- 

rosian 90% radius in r band, fracDeV.r is fracDeV in r band, color is the 

color indexes, i.e. u — g, g — r, r — i, i — z. 

(1) Csabai et al. 2003; (2) Suchkov, Hanisch & Margonet 2005; 

(3) Wadadekar 2005; (4) CoUister & Lahav 2004; (5) Budavdri et al. 2005. 

Correlation between parameters makes the regulating process more 
complicated. Although linear or non-linear polynomial regression 
is easy to communicate with astronomers, the systematic devia- 
tion is large (Brunner et al. 1997; Wang, Bahcall & Turner 1998; 
Budavari et al. 2005; Hsieh et al. 2005; Connolly et al. 1995). In 
recent years, a combination of HyperZ with the Bayesian marginal- 
ization was proposed by Benitez (2000). The dispersion of photo- 
metric redshifts using this combination technique was significantly 
improved. The results using Bayesian technique have been ame- 
liorated, nevertheless, the application of this method can introduce 
unrealistic effects in some studies. Therefore, this approach can be 
an alternative option when one is dealing with no spectral data. 

With large and deep photometric surveys are carried out, it 
seems that kernel regression will offer some significant advantages 
over other approaches, as shown in Table 4. The performance of 
kernel regression to predict photometric redshifts is comparable 
to ANNs and SVMs, superior to Kd-tree, ClassX and polynomial 
regression, and more preferable than CWW and Bruzual-Charlot 
(Wadadekar 2005; Collister & Lahav, 2004; Csabai et al. 2003; see 
their Tables 1). A major problem for empirical training-set method 
is the difficulty in extrapolating to regions where the input param- 
eters are not well represented by the training data. But for kernel 
regression, even though a few high-redshift galaxies exists in the 
sample, one can appropriately adjust bandwidth to obtain much 
more accurate redshifts. In addition, compared to other training-set 
methods, kernel regression has another advantage that it needn't 
retraining when a new query point appears. 



5 CONCLUSION 

We have presented an instance-based learning method called kernel 
regression to predict photometric redshifts of galaxies with the data 
from SDSS broadband photometry. Important work in kernel re- 
gression is how to determine the bandwidth. We use 10-fold cross- 
validation to choose the optimal bandwidth. Our experiments show 
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that the optimal bandwidth is different for different input parame- 
ters, the color+eClass pattern is the best when the rms error of pho- 
tometric redshift estimation adds up to 0.0189, the itgriz-FeClass is 
better when the rms error is 0.0198. Except these two situations, the 
color+r pattern is the best when the rms scatter is 0.0206. The pa- 
rameters, such as r50, r-90, fracDeV_r and c, contribute little infor- 
mation, however eClass shows much importance. Moreover kernel 
regression achieves high accuracy to predict photometric redshifts 
for early-type galaxies and the photometric eClass. For ANNs, the 
more parameters considered, the accuracy of photometric redshifts 
is higher (Way & Srivastava 2006; Li et al. 2006). While for ker- 
nel regression and SVMs, the accuracy is satisfactory only when 
appropriate parameters are chosen. To our satisfaction, kernel re- 
gression is able to measure photometric redshifts of galaxies, ac- 
curately. This is helpful to construct the sample of galaxies for the 
study of cosmology with minimal contamination from objects at se- 
riously incorrect redshifts. Similarly kernel regression may be ap- 
plied to predict photometric redshifts of quasars. 

Kernel regression has a number of flexibilities. It is possible 
to make different queries with not only different kernel widths h, 
but also different distance metrics, with subsets of attributes ig- 
nored, or with some other distance metrics such as Manhattan dis- 
tance, Canberra distance. It is also possible to apply the same tech- 
nique with different kernel functions for classification instead of 
regression. Unlike the traditional training methods, its best merit 
is the ability to make predictions with different parameters without 
needing a retraining phase, moreover it doesn't seriously depend 
on the size of sample. Nevertheless it has the obvious disadvan- 
tage of instance-based learning that is a significant computational 
cost on large data sets. In the future work we will explore different 
functions or other kinds of distance metric for kernel regression on 
the regression problems. In addition, we may use multiresolution 
instance-based learning as suggested by Deng & Moore (1995). 
This method succeeds in reducing the cost of instance-based learn- 
ing, moreover it has two advantages: flexibility to work throughout 
the local/global data; the ability to make predictions with different 
parameters without needing a retraining phase. 
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